1 Introducción

Procesamiento de lenguaje natural (natural language processing) o lingüística computacional (computational linguistics).

Extraer significado algorítmicamente de textos.

Los computadores son buenos para procesar texto, pero no son buenos entendiéndolo. Por el contrario los humanos son buenos para entender texto, pero no son buenos para procesarlo.

Objetivos

  • Identificar palabras con mayor importancia.
  • Cuantificar relaciones y conexiones entre palabras.

2 Caso de estudio:

3 Baladas:

Artistas:

1)Manuel_Medrano: -Canciones: Bajo_el_Agua_Una_y_otra_vez_La_Mujer_Que_Bota_Fuego_Si_Pudiera_La_Distancia{-} 2)Andres_Cepeda: -Canciones: Lo_Mejor_Que_Hay_en_mi_Vida_Mejor_que_a_ti_me_va_Besos_Usados_Tengo_Ganas{-} 3)Morat: -Canciones: Como_te_Atreves_Besos_en_Guerra_Aprender_a_Quererte_a_donde_Vamos{-} 4)Cara_Luna: -Canciones: Mi_Primer_Millon_Tabaco_Chanel_Pasos_Gigantes_Perderme_contigo{-}

https://www.letras.com/manuel-medrano/bajo-el-agua/ https://www.letras.com/manuel-medrano/una-y-otra-vez/ https://www.letras.com/manuel-medrano/la-mujer-que-bota-fuego/ https://www.letras.com/manuel-medrano/si-pudiera/ https://www.letras.com/manuel-medrano/la-distancia/ https://www.letras.com/andres-cepeda/lo-mejor-que-hay-en-mi-vida/ https://www.letras.com/andres-cepeda/mejor-que-a-ti-me-va/ https://www.letras.com/andres-cepeda/1792354/ https://www.letras.com/andres-cepeda/266343/ https://www.letras.com/morat/como-te-atreves/ https://www.letras.com/morat/besos-en-guerra/ https://www.letras.com/morat/aprender-a-quererte/ https://www.letras.com/morat/a-donde-vamos/ https://www.letras.com/bacilos/65340/ https://www.letras.com/bacilos/65341/ https://www.letras.com/bacilos/65342/ https://www.letras.com/bacilos/pasos-de-gigantes/ https://www.letras.com/bacilos/perderme-contigo/

4 Reggaeton:

Artistas:

1)Reykon: -Canciones: el_lider_El_chisme_Remix_Tu_Cuerpo_Me_Llama_Remix_El_Error_La_Santa_Ginza_Remix_Secretos_Domingo_Imaginandote{-} 2)J_Balvin: -Canciones: Ay_Vamos_6_Am_Rojo_Culpables_Safari_Mi_Gente_No_Es_Justo_Blanco{-} 3)Manuel_Turizo: -Canciones: Una_Lady_Como_Tu_Esclavos_de_tus_Besos_La_Bachata{-} 4)Maluma: -Canciones: Hawai_Borro_Cassette_El_Perdedor_Addicted_Carnaval_Cosas_Pendientes

https://www.youtube.com/watch?v=3xinCpjWxxU https://www.youtube.com/watch?v=jgQ2MSwgC6A https://www.youtube.com/watch?v=C3jp2lid58g https://www.youtube.com/watch?v=u5KFYnfKgWo https://www.youtube.com/watch?v=m8JoSkGVsFA https://www.youtube.com/watch?v=TapXs54Ah3E https://www.youtube.com/watch?v=yUV9JwiQLog&pp=ygUDNmFt https://www.youtube.com/watch?v=_tG70FWd1Ds&pp=ygUEcm9qbw%3D%3D https://www.youtube.com/watch?v=VYtJAuoZxcc&pp=ygUQdW5hIGxhZHkgY29tbyB0dQ%3D%3D https://www.youtube.com/watch?v=1afoVNPPQCI&pp=ygUUZXNjbGF2byBkZSB0dXMgYmVzb3M%3D https://www.youtube.com/watch?v=TiM_TFpT_DE&pp=ygUKbGEgYmFjaGF0YQ%3D%3D https://www.youtube.com/watch?v=ZFwpzIz8eWE&pp=ygUIc2VjcmV0b3M%3D https://www.youtube.com/watch?v=f7uFHxg6nks&pp=ygUMaW1hZ2luYW5kb3Rl https://www.youtube.com/watch?v=KIvhiN0WHfY&pp=ygUHZG9taW5nbw%3D%3D https://www.youtube.com/watch?v=JWESLtAKKlU&pp=ygUGc2FmYXJp https://www.youtube.com/watch?v=wnJ6LuUFpMo&pp=ygUIbWkgZ2VudGU%3D https://www.youtube.com/watch?v=2zn4dAuZ2RU&pp=ygULbm8gZXMganVzdG8%3D https://www.youtube.com/watch?v=8j1xiiAZhIQ&pp=ygUGYmxhbmNv https://www.youtube.com/watch?v=pK060iUFWXg&pp=ygUGaGF3YWlp https://www.youtube.com/watch?v=Xk0wdDTTPA0&pp=ygUVYm9ycm8gY2Fzc2V0dGUgbWFsdW1h0gcJCY0JAYcqIYzv https://www.youtube.com/watch?v=PJniSb91tvo&pp=ygULZWwgcGVyZGVkb3LSBwkJjQkBhyohjO8%3D https://www.youtube.com/watch?v=pMIHC_cItd4&pp=ygUPYWRkaWN0ZWQgbWFsdW1h https://www.youtube.com/watch?v=ufa0K9w9z2c&pp=ygUPY2FybmF2YWwgbWFsdW1h https://www.youtube.com/watch?v=6vPhcRew8hA&pp=ygUQY29zYXMgcGVuZGllbnRlcw%3D%3D

5 Rock:

Artistas:

1)Los_De_Adentro: -Canciones: Nubes_Negras_Quiero_Amarte_No_Mas_Tal_Vez{-} 2)Kraken: -Canciones: Fragil_al_Viento_Vestido_de_Cristal_America_Silencioso_Amor_Arteciopelados_Baracunata_Florecita_Rockera{-} 3)Caifanes: -Canciones: Afuera_Viento_No_dejes_que{-} 4)Enanitos_Verdes: -Canciones: La_Muralla_Verde{-}

https://youtu.be/8_Tc5uP8SL4?si=DvK_wDQjGbTRIebT https://youtu.be/8hGAklEil10?si=G5-nZzi1n0rdJprE https://youtu.be/s09hOXaPhJ8?si=YLYijSYH1bMkCWnS https://youtu.be/HqiX6-f5w-s?si=Ar1XWmjQP62TSuJX https://youtu.be/1tVF5rpmFM4?si=N9ItbxZCNnGU8tBm https://youtu.be/I4YtarQbE7U?si=GbTw8n7ih3JVFAyN https://youtu.be/Pcy_F40W9EM?si=_DXgEk08DYAgLU2F https://youtu.be/Q3ReRsnYG4I?si=mmNq8bDT1ONyJpL1 https://youtu.be/mqOCHYhRaGY?si=hKE0l0zyVzcQQr8V https://youtu.be/ARR3gkzX8I0?si=JCz3UO1Uzjz6FijW https://youtu.be/DNbG5IIA71w?si=v1-D2Rjm-HVVkx_a https://youtu.be/9KIshSBiojI?si=gZcaeeLWVnf73mxa https://youtu.be/i17Go6G-siA?si=E5JJA9aPvYxDtITo https://youtu.be/tYGZ1YCD2YU?si=vIcjZ9h9ZK69-omp

6 Salsa:

Artistas:

1)Joe_Arroyo: -Canciones: Rebelión_En_Barranquilla_me_quedo_Tal_para_cual_Pa’l_bailador_Te_quiero_más{-} 2)Fruko_y_sus_Tesos: -Canciones: El_Preso_Los_Charcos_Cachondea_El_Ausente_El_Son_del_Tren{-} 3)Grupo_Niche: -Canciones: Sin_sentimiento_Algo_que_se_quede_Cali_pachanguero_Se_pareció_tanto_a_ti_Una_aventura{-} 4)Yuri_Buenaventura: -Canciones: No_Estoy_contigo_¿Dónde_Estás?_Salsa_Tu_Cancion_El_Guerrero{-}

https://www.youtube.com/watch?v=7HtWEPfQJxw https://www.youtube.com/watch?v=ms3VDvksgks https://www.youtube.com/watch?v=I3eXeVRqtHQ https://www.youtube.com/watch?v=o0Bn_qVzZvE https://www.youtube.com/watch?v=NlemaAlPeZs

7 Importar texto

##### importar datos
suppressMessages(suppressWarnings(library(readr)))
suppressMessages(suppressWarnings(library(tidyverse)))
# warnings debido a caracteres no UTF-8 o vacios ("")
# UTF-8 (8-bit Unicode Transformation Format) es un formato de codificación de caracteres 
# capaz de codificar todos los code points validos en Unicode
text_Baladas <- read_csv("baladas.txt", col_names = FALSE, show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
class(text_Baladas)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
text_Baladas <- c(text_Baladas)
class(text_Baladas)
## [1] "list"
text_Baladas <- unlist(text_Baladas)
class(text_Baladas)
## [1] "character"
names(text_Baladas) <- NULL  # importante!
head(text_Baladas, n = 3)
## [1] "Quiero volar contigo"    "Muy alto en algún lugar"
## [3] "Quisiera estar contigo"
# Reggaeton
text_Reggaeton <- unlist(c(read_csv("Reggaeton_proyecto.txt", col_names = FALSE, show_col_types = FALSE)))
names(text_Reggaeton) <- NULL
# Rock 
text_Rock_canciones <- unlist(c(read_csv("Rock_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_Rock_canciones) <- NULL
# Salsa 
text_Salsa_canciones <- unlist(c(read_csv("Salsa_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_Salsa_canciones) <- NULL
##### data frame formato tidy
# Baladas
text_Baladas <- tibble(line = 1:length(text_Baladas), text = text_Baladas)  # tibble en lugar de data_frame
class(text_Baladas)
## [1] "tbl_df"     "tbl"        "data.frame"
dim(text_Baladas)
## [1] 957   2
head(text_Baladas, n = 3)
## # A tibble: 3 × 2
##    line text                   
##   <int> <chr>                  
## 1     1 Quiero volar contigo   
## 2     2 Muy alto en algún lugar
## 3     3 Quisiera estar contigo
# texto no normalizado
# no tiene "estructura" para analizar
# Reggaeton
text_Reggaeton<- tibble(line = 1:length(text_Reggaeton), text = text_Reggaeton)
# Rock canciones 
text_Rock_canciones<- tibble(line = 1:length(text_Rock_canciones), text = text_Rock_canciones)
# Salsa canciones 
text_Salsa_canciones<- tibble(line = 1:length(text_Salsa_canciones), text = text_Salsa_canciones)

8 Tokenización

Almacenar el texto en formato estructurado.

Token: unidad de análisis.

La tokenización básica consiste en que cada token es una palabra.

Formato de de un token por linea.

Por defecto se elimina la puntuación y se normaliza el texto a minúsculas (las tíldes no se eliminan por defecto).

suppressMessages(suppressWarnings(library(tidytext)))
suppressMessages(suppressWarnings(library(magrittr)))
##### tokenizacion formato tidy
# ---------- Baladas ----------
text_Baladas %<>%
  unnest_tokens(input = text, output = word) %>%
  filter(!is.na(word))  # importante!
class(text_Baladas)
## [1] "tbl_df"     "tbl"        "data.frame"
dim(text_Baladas)
## [1] 6071    2
head(text_Baladas, n = 10)
## # A tibble: 10 × 2
##     line word    
##    <int> <chr>   
##  1     1 quiero  
##  2     1 volar   
##  3     1 contigo 
##  4     2 muy     
##  5     2 alto    
##  6     2 en      
##  7     2 algún   
##  8     2 lugar   
##  9     3 quisiera
## 10     3 estar
# ---------- Reggaeton ----------
text_Reggaeton %<>%
  unnest_tokens(input = text, output = word) %>%
  filter(!is.na(word))
dim(text_Reggaeton)
## [1] 4432    2
head(text_Reggaeton, n = 10)
## # A tibble: 10 × 2
##     line word    
##    <int> <chr>   
##  1     1 el      
##  2     1 chisme  
##  3     1 remix   
##  4     2 ayo     
##  5     3 the     
##  6     3 official
##  7     3 remix   
##  8     3 baby    
##  9     4 me      
## 10     4 duele
# -----------Rock_canciones-------
text_Rock_canciones%<>%
    unnest_tokens(input = text, output = word) %>%
    filter(!is.na(word))
dim(text_Rock_canciones)
## [1] 2844    2
head(text_Rock_canciones, n = 10)
## # A tibble: 10 × 2
##     line word   
##    <int> <chr>  
##  1     1 los    
##  2     1 de     
##  3     1 adentro
##  4     1 nubes  
##  5     1 negras 
##  6     2 ti     
##  7     2 movería
##  8     2 cielo  
##  9     2 y      
## 10     2 tierra
# -----------Salsa_canciones-------
text_Salsa_canciones%<>%
    unnest_tokens(input = text, output = word) %>%
    filter(!is.na(word))
dim(text_Salsa_canciones)
## [1] 5472    2
head(text_Salsa_canciones, n = 10)
## # A tibble: 10 × 2
##     line word     
##    <int> <chr>    
##  1     1 a        
##  2     1 joe      
##  3     1 arrollo  
##  4     2 canciones
##  5     3 i        
##  6     3 rebelion 
##  7     4 quiero   
##  8     4 contarle 
##  9     4 mi       
## 10     4 hermano

9 Normalización del texto

Remover:

##### texto con numeros?
# ---------- Baladas ----------
text_Baladas %>%
  filter(grepl(pattern = '[0-9]', x = word)) %>% 
  count(word, sort = TRUE)
## # A tibble: 1 × 2
##   word      n
##   <chr> <int>
## 1 29        1
# ---------- Reggaeton ----------
text_Reggaeton %>%
  filter(grepl(pattern = '[0-9]', x = word)) %>% 
  count(word, sort = TRUE)
## # A tibble: 2 × 2
##   word      n
##   <chr> <int>
## 1 6         5
## 2 440       1
# ------------Rock_canciones-----------
text_Rock_canciones %>%
    filter(grepl(pattern = '[0-9]', x = word)) %>%
    count(word, sort = TRUE)
## # A tibble: 0 × 2
## # ℹ 2 variables: word <chr>, n <int>
# ------------Salsa_canciones-----------
text_Salsa_canciones %>%
    filter(grepl(pattern = '[0-9]', x = word)) %>%
    count(word, sort = TRUE)
## # A tibble: 0 × 2
## # ℹ 2 variables: word <chr>, n <int>
##### remover texto con numeros
# ---------- Baladas ----------
text_Baladas %<>%
  filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Baladas)
## [1] 6070    2
# ---------- Reggaeton ----------
text_Reggaeton %<>%
  filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Reggaeton)
## [1] 4426    2
# -----------Rock_canciones-----
text_Rock_canciones %<>%
  filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Rock_canciones)
## [1] 2844    2
# -----------Salsa_canciones-----
text_Salsa_canciones %<>%
  filter(!grepl(pattern = '[0-9]', x = word))
dim(text_Salsa_canciones)
## [1] 5472    2
##### stop words 
# 3 diccionarios en ingles (onix, SMART, snowball) incluidos por defecto en tidytext
data(stop_words)
class(stop_words)
## [1] "tbl_df"     "tbl"        "data.frame"
dim(stop_words)
## [1] 1149    2
head(stop_words, n = 10)
## # A tibble: 10 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART
table(stop_words$lexicon)
## 
##     onix    SMART snowball 
##      404      571      174
###### stop words 
# no hay diccionarios en español disponibles en tidytext
# diccionario COUNTWORDSFREE en español (con acentos)
# http://countwordsfree.com/stopwords/spanish
# otras alternativas:
#   https://github.com/stopwords-iso/stopwords-es
#   de tm::stopwords("spanish")
# se conserva el mismo formato de los diccionarios en tidytext
stop_words_es <- tibble(word = unlist(c(read.table("Stopwords.txt", quote="\"", comment.char=""))), lexicon = "custom")
dim(stop_words_es)
## [1] 102   2
head(stop_words_es, n = 10)
## # A tibble: 10 × 2
##    word  lexicon
##    <chr> <chr>  
##  1 La    custom 
##  2 lo    custom 
##  3 las   custom 
##  4 un    custom 
##  5 una   custom 
##  6 de    custom 
##  7 en    custom 
##  8 con   custom 
##  9 por   custom 
## 10 para  custom
##### remover stop words
# ---------- Baladas ----------
text_Baladas %<>% 
  anti_join(x = ., y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Baladas)
## [1] 2720    2
head(text_Baladas, n = 10)
## # A tibble: 10 × 2
##     line word    
##    <int> <chr>   
##  1     1 quiero  
##  2     1 volar   
##  3     1 contigo 
##  4     2 muy     
##  5     2 alto    
##  6     2 algún   
##  7     2 lugar   
##  8     3 quisiera
##  9     3 contigo 
## 10     4 viendo
# ---------- Reggaeton ----------
text_Reggaeton %<>% 
  anti_join(x = ., y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Reggaeton)
## [1] 2047    2
head(text_Reggaeton, n = 10)
## # A tibble: 10 × 2
##     line word     
##    <int> <chr>    
##  1     1 chisme   
##  2     2 ayo      
##  3     3 the      
##  4     3 official 
##  5     3 baby     
##  6     4 duele    
##  7     4 haberte  
##  8     4 entregado
##  9     4 amor     
## 10     4 puro
#-----------Rock_canciones------
text_Rock_canciones %<>%
    anti_join(x = . , y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Rock_canciones)
## [1] 1489    2
head(text_Rock_canciones, n = 10)
## # A tibble: 10 × 2
##     line word   
##    <int> <chr>  
##  1     1 adentro
##  2     1 nubes  
##  3     1 negras 
##  4     2 movería
##  5     2 cielo  
##  6     2 tierra 
##  7     2 pudiera
##  8     3 cuanto 
##  9     3 daría  
## 10     3 tenerte
#-----------Salsa_canciones------
text_Salsa_canciones %<>%
    anti_join(x = . , y = stop_words_es)
## Joining with `by = join_by(word)`
dim(text_Salsa_canciones)
## [1] 2657    2
head(text_Salsa_canciones, n = 10)
## # A tibble: 10 × 2
##     line word     
##    <int> <chr>    
##  1     1 joe      
##  2     1 arrollo  
##  3     2 canciones
##  4     3 i        
##  5     3 rebelion 
##  6     4 quiero   
##  7     4 contarle 
##  8     4 hermano  
##  9     5 pedacito 
## 10     5 historia
##### remover acentos
replacement_list <- list('á' = 'a', 'é' = 'e', 'í' = 'i', 'ó' = 'o', 'ú' = 'u')
# ---------- Baladas ----------
text_Baladas %<>% 
  mutate(word = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word))
dim(text_Baladas)
## [1] 2720    2
head(text_Baladas, n = 10)
## # A tibble: 10 × 2
##     line word    
##    <int> <chr>   
##  1     1 quiero  
##  2     1 volar   
##  3     1 contigo 
##  4     2 muy     
##  5     2 alto    
##  6     2 algun   
##  7     2 lugar   
##  8     3 quisiera
##  9     3 contigo 
## 10     4 viendo
# ---------- Reggaeton ----------
text_Reggaeton %<>% 
  mutate(word = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word))
dim(text_Reggaeton)
## [1] 2047    2
head(text_Reggaeton, n = 10)
## # A tibble: 10 × 2
##     line word     
##    <int> <chr>    
##  1     1 chisme   
##  2     2 ayo      
##  3     3 the      
##  4     3 official 
##  5     3 baby     
##  6     4 duele    
##  7     4 haberte  
##  8     4 entregado
##  9     4 amor     
## 10     4 puro
#-----------------Rock_canciones-------------
text_Rock_canciones %<>%
  mutate(word = chartr(old = names(replacement_list)%>% str_c(collapse = ''),
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word))
dim(text_Rock_canciones)
## [1] 1489    2
head(text_Rock_canciones, n= 10)
## # A tibble: 10 × 2
##     line word   
##    <int> <chr>  
##  1     1 adentro
##  2     1 nubes  
##  3     1 negras 
##  4     2 moveria
##  5     2 cielo  
##  6     2 tierra 
##  7     2 pudiera
##  8     3 cuanto 
##  9     3 daria  
## 10     3 tenerte
#-----------------Salsa_canciones-------------
text_Salsa_canciones %<>%
  mutate(word = chartr(old = names(replacement_list)%>% str_c(collapse = ''),
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word))
dim(text_Salsa_canciones)
## [1] 2657    2
head(text_Salsa_canciones, n= 10)
## # A tibble: 10 × 2
##     line word     
##    <int> <chr>    
##  1     1 joe      
##  2     1 arrollo  
##  3     2 canciones
##  4     3 i        
##  5     3 rebelion 
##  6     4 quiero   
##  7     4 contarle 
##  8     4 hermano  
##  9     5 pedacito 
## 10     5 historia

10 Tokens más frecuentes

##### top 10 de tokens mas frecuentes
# ---------- Baladas ----------
text_Baladas%>% 
  count(word, sort = TRUE) %>%
  head(n = 10)
## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 quiero     51
##  2 solo       36
##  3 vida       36
##  4 contigo    29
##  5 amor       23
##  6 besos      23
##  7 otra       23
##  8 se         21
##  9 como       20
## 10 mujer      19
# ---------- Reggaeton ----------
text_Reggaeton%>% 
  count(word, sort = TRUE)  %>%
  head(n = 10)
## # A tibble: 10 × 2
##    word         n
##    <chr>    <int>
##  1 quiero      30
##  2 cama        19
##  3 ganas       19
##  4 recuerdo    18
##  5 santa       18
##  6 encanta     17
##  7 amor        16
##  8 baby        15
##  9 solo        15
## 10 dale        14
#------------Rock_canciones-------------
text_Rock_canciones%>%
  count(word, sort = TRUE) %>%
  head(n = 10)
## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 solo       33
##  2 quiero     32
##  3 amor       26
##  4 afuera     16
##  5 tuyas      14
##  6 corazon    13
##  7 negras     13
##  8 nubes      13
##  9 que        13
## 10 adentro    12
#------------Salsa_canciones-------------
text_Salsa_canciones%>%
  count(word, sort = TRUE) %>%
  head(n = 10)
## # A tibble: 10 × 2
##    word       n
##    <chr>  <int>
##  1 amor      53
##  2 son       52
##  3 mas       46
##  4 quiero    41
##  5 salsa     30
##  6 quedo     27
##  7 vida      27
##  8 mundo     25
##  9 negra     24
## 10 jamas     23
##### viz
suppressMessages(suppressWarnings(library(gridExtra)))
# ---------- Baladas ----------
text_Baladas %>%
  count(word, sort = TRUE) %>%
  filter(n > 14) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
    theme_light() + 
    geom_col(fill = 'red4', alpha = 0.8) +
    xlab(NULL) +
    ylab("Frecuencia") +
    coord_flip() +
    ggtitle(label = 'Baladas: Conteo de palabras') -> p1
# ---------- Reggaeton ----------
text_Reggaeton %>%
  count(word, sort = TRUE) %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
    theme_light() + 
    geom_col(fill = 'blue4', alpha = 0.8) +
    xlab(NULL) +
    ylab("Frecuencia") +
    coord_flip() +
    ggtitle(label = 'Reggaeton: Conteo de palabras') -> p2
#-----------Rock_canciones--------
text_Rock_canciones %>%
  count(word, sort = TRUE) %>%
  filter(n > 10) %>%
  mutate(word = reorder(word,n)) %>%
  ggplot(aes(x = word, y = n)) +
    theme_light()+
    geom_col(fill= 'purple4', alpha = 0.8)+
    xlab(NULL)+
    ylab("Frecuencia")+
    coord_flip()+
    ggtitle(label = 'Rock: Conteo de palabras') -> p3
#-----------Salsa_canciones--------
text_Salsa_canciones %>%
  count(word, sort = TRUE) %>%
  filter(n > 15) %>%
  mutate(word = reorder(word,n)) %>%
  ggplot(aes(x = word, y = n)) +
    theme_light()+
    geom_col(fill= 'yellow4', alpha = 0.8)+
    xlab(NULL)+
    ylab("Frecuencia")+
    coord_flip()+
    ggtitle(label = 'Salsa: Conteo de palabras') -> p4
# desplegar grafico 
grid.arrange(p1, p2, p3, p4)

suppressMessages(suppressWarnings(library(wordcloud)))
###### viz
par(mfrow = c(2,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# ---------- Baladas ----------
set.seed(123)
text_Baladas %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(words = word, freq = n, max.words = 12, colors = 'red4'))
title(main = "Baladas: Nube de Palabras")
# ---------- Reggaeton ----------
set.seed(123)
text_Reggaeton %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(words = word, freq = n, max.words = 10, colors = 'blue4'))
## Warning in wordcloud(words = word, freq = n, max.words = 10, colors = "blue4"):
## quiero could not be fit on page. It will not be plotted.
title(main = "Reggaeton: Nube de Palabras")
#-------------Rock_canciones-----------
set.seed(123)
text_Rock_canciones %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(words = word, freq = n, max.words = 10, colors = 'purple4'))
title(main = "Rock: Nube de Palabras")
#-------------Salsa_canciones-----------
set.seed(1234)
text_Salsa_canciones %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(words = word, freq = n, max.words = 10, colors = 'yellow4'))
## Warning in wordcloud(words = word, freq = n, max.words = 10, colors =
## "yellow4"): quiero could not be fit on page. It will not be plotted.
title(main = "Salsa: Nube de Palabras")

##### frecuencias relativas de la palabras
bind_rows(mutate(.data = text_Baladas, author = "Baladas"),
                       mutate(.data = text_Reggaeton, author = "Reggaeton"),
                        mutate(.data = text_Rock_canciones, author = "Rock_canciones"),
                          mutate(.data = text_Salsa_canciones, author = "Salsa_canciones")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n/sum(n)) %>%
  select(-n) %>%
  spread(author, proportion, fill = 0) -> frec  # importante!
frec %<>% 
  select(word, Baladas, Reggaeton, Rock_canciones, Salsa_canciones)
dim(frec)
## [1] 2502    5
head(frec, n = 10)
## # A tibble: 10 × 5
##    word         Baladas Reggaeton Rock_canciones Salsa_canciones
##    <chr>          <dbl>     <dbl>          <dbl>           <dbl>
##  1 abandonados 0         0              0               0.000376
##  2 abandones   0         0              0.000672        0       
##  3 abatidas    0         0              0               0.000376
##  4 abeja       0         0              0.00134         0       
##  5 abrazame    0         0              0               0.000376
##  6 abrazarte   0.000368  0.000489       0               0       
##  7 abrazo      0.000368  0              0               0.000376
##  8 abriendo    0         0              0.000672        0       
##  9 abrigarte   0         0.000489       0               0       
## 10 aburrida    0.000368  0              0               0
##### top 10 palabras en comun
# orden anidado respecto a Baladas, Reggaeton, Rock y Salsa
frec %>%
  filter(Baladas !=0, Reggaeton != 0, Rock_canciones != 0) %>%
  arrange(desc(Baladas), desc(Reggaeton), desc(Rock_canciones), desc(Salsa_canciones)) -> frec_comun
dim(frec_comun)
## [1] 100   5
head(frec_comun, n = 10)
## # A tibble: 10 × 5
##    word    Baladas Reggaeton Rock_canciones Salsa_canciones
##    <chr>     <dbl>     <dbl>          <dbl>           <dbl>
##  1 quiero  0.0188    0.0147        0.0215          0.0154  
##  2 solo    0.0132    0.00733       0.0222          0.00715 
##  3 vida    0.0132    0.00489       0.00269         0.0102  
##  4 contigo 0.0107    0.00489       0.00201         0.00151 
##  5 amor    0.00846   0.00782       0.0175          0.0199  
##  6 besos   0.00846   0.00244       0.00134         0.000753
##  7 mujer   0.00699   0.00195       0.00134         0.000753
##  8 nada    0.00662   0.00440       0.00403         0.00188 
##  9 siempre 0.00625   0.00293       0.00269         0.00414 
## 10 nunca   0.00588   0.00244       0.000672        0.00226
###### proporcion palabras en comun
dim(frec_comun)[1]/dim(frec)[1]
## [1] 0.03996803
##### correlacion de las frecuencias
# cuidado con los supuestos de la prueba
# es posible usar Bootstrap como alternativa
cor.test(x = frec$Baladas, y = frec$Baladas)
## 
##  Pearson's product-moment correlation
## 
## data:  frec$Baladas and frec$Baladas
## t = Inf, df = 2500, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
cor.test(x = frec_comun$Reggaeton, y = frec_comun$Reggaeton)
## 
##  Pearson's product-moment correlation
## 
## data:  frec_comun$Reggaeton and frec_comun$Reggaeton
## t = Inf, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
cor.test(x = frec_comun$Rock_canciones, y = frec_comun$Rock_canciones)
## 
##  Pearson's product-moment correlation
## 
## data:  frec_comun$Rock_canciones and frec_comun$Rock_canciones
## t = 664343859, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
cor.test(x = frec_comun$Salsa_canciones, y = frec_comun$Salsa_canciones)
## 
##  Pearson's product-moment correlation
## 
## data:  frec_comun$Salsa_canciones and frec_comun$Salsa_canciones
## t = Inf, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1

11 Análisis de sentimiento

A las palabras (tokens simples o unigramas) se les asigna un puntaje (escala, positivo/negativo, emoción).

El sentimiento se define como la suma del puntaje de las palabras individuales.

Diccionarios:

Objetivos:

Caveats:

##### sentiments 
# 3 diccionarios en ingles (AFINN, Bing, NRC) incluidos por defecto en tidytext
# AFINN: Finn Arup Nielsen, escala de -5 a 5.
#   http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
# Bing: Bing Liu and collaborators, clasificacion binaria (+/-).
#   https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
# NRC: Saif Mohammad and Peter Turney, clasificacion binaria (+/-) y algunas categorias.
#   http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
# diccionarios
# no hay diccionarios en español disponibles en tidytext
# https://www.kaggle.com/datasets/rtatman/sentiment-lexicons-for-81-languages
positive_words <- read_csv("Positive_words.txt", col_names = "word", show_col_types = FALSE) %>%
  mutate(sentiment = "Positivo")
negative_words <- read_csv("Negativewords.txt", col_names = "word", show_col_types = FALSE) %>%
  mutate(sentiment = "Negativo")
sentiment_words <- bind_rows(positive_words, negative_words)
# comparacion de diccionarios
get_sentiments("bing") %>%
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005
sentiment_words %>%
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 Negativo     95
## 2 Positivo     71
###### viz
suppressMessages(suppressWarnings(library(RColorBrewer)))
# ---------- Baladas ----------
text_Baladas %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(n > 2) %>%
  mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
    geom_col() +
    scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
    coord_flip(ylim = c(-7,7)) +
    labs(y = "Frecuencia",
         x = NULL,
         title = "Baladas") +
    theme_minimal() +
    theme(legend.position = "none") -> p1
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 263 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# ---------- Reggaeton ----------
text_Reggaeton %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(n > 2) %>%
  mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
    geom_col() +
    scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
    coord_flip(ylim = c(-7,7)) +
    labs(y = "Frecuencia",
         x = NULL,
         title = "Reggaeton") +
    theme_minimal() +
    theme(legend.position = "none") -> p2 
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1504 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
#-------------Rock_canciones------------------
text_Rock_canciones %>%
    inner_join(sentiment_words) %>%
    count(word, sentiment, sort = TRUE) %>%
    filter(n > 2) %>%
    mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
    mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
    geom_col() +
    scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
    coord_flip(ylim = c(-7,7)) +
    labs(y = "Frecuencia",
         x = NULL,
         title = "Rock") +
    theme_minimal() +
    theme(legend.position = "none") -> p3
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 31 of `x` matches multiple rows in `y`.
## ℹ Row 72 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
#-------------Salsa_canciones------------------
text_Salsa_canciones %>%
    inner_join(sentiment_words) %>%
    count(word, sentiment, sort = TRUE) %>%
    filter(n > 2) %>%
    mutate(n = ifelse(sentiment == "Negativo", -n, n)) %>%
    mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
    geom_col() +
    scale_fill_manual(values = brewer.pal(8,'Dark2')[c(2,5)]) +
    coord_flip(ylim = c(-7,7)) +
    labs(y = "Frecuencia",
         x = NULL,
         title = "Salsa",
         fill = "Sentimiento"
         ) +
    theme_minimal() -> p4
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435 of `x` matches multiple rows in `y`.
## ℹ Row 71 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# desplegar grafico
grid.arrange(p1, p2, p3, p4, ncol  = 4)

suppressMessages(suppressWarnings(library(reshape2)))  # acast

##### viz
par(mfrow = c(2,2), mar = c(1,1,3,1), mgp = c(1,1,1))
# ---------- Baladas ----------
set.seed(123)
text_Baladas %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = c("darkred", "darkgreen"),
    title.size = 0.01,
    title.colors = c("white", "white"),
    family = "serif",
    scale = c(3,1),
    max.words = 50 
    )
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 263 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
title(main = "Baladas: NB Sentimiento")
# ---------- Reggaeton ----------
set.seed(123)
text_Reggaeton %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = c("red", "green"),
    title.size = 1.5,
    family = "serif",
    scale = c(3,1),
    max.words = 50 
  )
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1504 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
title(main = "Reggaeton: NB Sentimiento")
#--------Rock_canciones---------
set.seed(123)
text_Rock_canciones %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = c("red", "green"),
    title.size = 1.5,
    family = "serif",
    scale = c(3,1),
    max.words = 50 
  )
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 31 of `x` matches multiple rows in `y`.
## ℹ Row 72 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
title(main = "Rock: NB Sentimiento")
#--------Salsa_canciones---------
set.seed(123)
text_Salsa_canciones %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = c("darkred", "darkgreen"),
    title.size = 1.5,
    family = "serif",
    scale = c(3,1),
    max.words = 50 
  )
## Joining with `by = join_by(word)`
## Warning in inner_join(., sentiment_words): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435 of `x` matches multiple rows in `y`.
## ℹ Row 71 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
title(main = "Salsa: NB Sentimiento")

12 Baladas: Bigramas

Se ha usado unnest_tokens para tokenizar por palabras individuales.

Ahora se quiere tokenizar por secuencias de palabras.

¿Qué palabras tienden a seguir otras? ¿Qué palabras tienden a co-ocurrir juntas?

##### bigramas: Ejemplo cancion Rock_canciones
# texto
text <- c("Préstame tu peine",
          "Y péiname el alma",
          "Desenrédame",
          "Fuera de este mundo",
          "Dime que no estoy",
          "Soñándote",
          "Enséñame",
          "De qué estamos hechos",
          "Que quiero orbitar planetas",
          "Hasta ver uno vació",
          "Que quiero irme a vivir",
          "Pero que sea contigo",
          "Viento",
          "Amárranos",
          "Tiempo",
          "Detente muchos años",
          "Viento",
          "Amárranos",
          "Tiempo",
          "Detente muchos años")

# convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line bigram     
##    <int> <chr>      
##  1     1 préstame tu
##  2     1 tu peine   
##  3     2 y péiname  
##  4     2 péiname el 
##  5     2 el alma    
##  6     3 <NA>       
##  7     4 fuera de   
##  8     4 de este    
##  9     4 este mundo 
## 10     5 dime que
#### bigramas: Ejemplo cancion Baladas
# texto
text <- c("Que si pudiera darle vueltas a la Tierra una y otra vez",
          "Yo buscaría de alguien con tus mismos ojos, con tus mismos labios",
          "Con tu misma boca y con tu misma piel",
          "Que si pudiera darle al tiempo otro poco de tiempo",
          "Para comprender que sin ti, mi vida ya no la siento",
          "Que el color se vuelve a blanco y negro",
          "Y sé que la distancia me hizo ciego",
          "En todos los momentos",
          "Los que tenía que verte aquí, Que si pudiera darle vueltas a la Tierra una y otra vez",
          "Yo buscaría de alguien con tus mismos ojos, con tus mismos labios",
          "Con tu misma boca y con tu misma piel",
          "Que si pudiera darle al tiempo otro poco de tiempo",
          "Para comprender que sin ti, mi vida ya no la siento",
          "Que el color se vuelve a blanco y negro",
          "Y sé que la distancia me hizo ciego",
          "En todos los momentos",
          "Los que tenía que verte aquí")

#convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line bigram       
##    <int> <chr>        
##  1     1 que si       
##  2     1 si pudiera   
##  3     1 pudiera darle
##  4     1 darle vueltas
##  5     1 vueltas a    
##  6     1 a la         
##  7     1 la tierra    
##  8     1 tierra una   
##  9     1 una y        
## 10     1 y otra
##### bigramas: Ejemplo Reggaeton
# Texto
text <- c("No hay que sufrir, no hay que llorar",
          "La vida es una y es un carnaval",
          "baladas Lo malo se irá, todo pasará",
          "La vida es una y es un carnaval",
          "La vida es una y es un carnaval",
          "La vida es una y es un carnaval",
          "Seré tu ángel guardián",
          "Tu mejor compañía",
          "Toma fuerte mi mano",
          "Te enseñaré a volar",
          "Ya no habrá mal de amores",
          "Vendrán tiempos mejores",
          "Levanta ya tu mano que vinimos a gozar")

#convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line bigram    
##    <int> <chr>     
##  1     1 no hay    
##  2     1 hay que   
##  3     1 que sufrir
##  4     1 sufrir no 
##  5     1 no hay    
##  6     1 hay que   
##  7     1 que llorar
##  8     2 la vida   
##  9     2 vida es   
## 10     2 es una
#### bigramas: Ejemplo Salsa
# Texto
text <- c("Me siento en el techo y empiezo a ordenar para ti",
          "Los besos que no pude dibujar",
          "Sale el sol, me acaricia",
          "Nace tu canción",
          "La gente como tú, enciende mi ser",
          "Perdóname si molesto, estas cosas que hago yo",
          "Ya ves, ya he olvidado o si son verdes o son azul",
          "Y te diré, tus ojos viven en mi",
          "Y son los más tiernos de la mayoría")

#convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line bigram      
##    <int> <chr>       
##  1     1 me siento   
##  2     1 siento en   
##  3     1 en el       
##  4     1 el techo    
##  5     1 techo y     
##  6     1 y empiezo   
##  7     1 empiezo a   
##  8     1 a ordenar   
##  9     1 ordenar para
## 10     1 para ti
##### importar datos
text_Rock_canciones <- unlist(c(read_csv("Rock_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_Rock_canciones) <- NULL
text_Rock_canciones <- tibble(line = 1:length(text_Rock_canciones), text = text_Rock_canciones)
##### importar datos
text_Baladas <- unlist(c(read_csv("Baladas.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_Baladas) <- NULL
text_Baladas<- tibble(line = 1:length(text_Baladas), text = text_Baladas)
##### importar datos
text_Reggaeton <- unlist(c(read_csv("Reggaeton_proyecto.txt", col_names = FALSE, show_col_types = FALSE)))
names(text_Reggaeton) <- NULL
text_Reggaeton <- tibble(line = 1:length(text_Reggaeton), text = text_Reggaeton)
##### importar datos
text_Salsa_canciones <- unlist(c(read_csv("Salsa_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_Salsa_canciones) <- NULL
text_Salsa_canciones <- tibble(line = 1:length(text_Salsa_canciones), text = text_Salsa_canciones)
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Rock_canciones %>%
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) -> text_Rock_canciones_bi  # importante!
dim(text_Rock_canciones_bi)
## [1] 2210    2
head(text_Rock_canciones_bi, n = 10)
## # A tibble: 10 × 2
##     line bigram       
##    <int> <chr>        
##  1     1 los de       
##  2     1 de adentro   
##  3     1 adentro nubes
##  4     1 nubes negras 
##  5     2 ti movería   
##  6     2 movería cielo
##  7     2 cielo y      
##  8     2 y tierra     
##  9     2 tierra si    
## 10     2 si pudiera
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Baladas %>%
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) -> text_Baladas_bi  # importante!
dim(text_Baladas_bi)
## [1] 5114    2
head(text_Baladas_bi, n = 10)
## # A tibble: 10 × 2
##     line bigram        
##    <int> <chr>         
##  1     1 quiero volar  
##  2     1 volar contigo 
##  3     2 muy alto      
##  4     2 alto en       
##  5     2 en algún      
##  6     2 algún lugar   
##  7     3 quisiera estar
##  8     3 estar contigo 
##  9     4 viendo las    
## 10     4 las estrellas
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Reggaeton %>%
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) -> text_Reggaeton_bi  # importante!
dim(text_Reggaeton_bi)
## [1] 3693    2
head(text_Reggaeton_bi, n = 10)
## # A tibble: 10 × 2
##     line bigram           
##    <int> <chr>            
##  1     1 el chisme        
##  2     1 chisme remix     
##  3     3 the official     
##  4     3 official remix   
##  5     3 remix baby       
##  6     4 me duele         
##  7     4 duele haberte    
##  8     4 haberte entregado
##  9     4 entregado un     
## 10     4 un amor
##### tokenizar en bigramas
# en este caso cada token es un bigrama
text_Salsa_canciones %>%
  unnest_tokens(tbl = ., input = text, output = bigram, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) -> text_Salsa_canciones_bi  # importante!
dim(text_Salsa_canciones_bi)
## [1] 4517    2
head(text_Salsa_canciones_bi, n = 10)
## # A tibble: 10 × 2
##     line bigram         
##    <int> <chr>          
##  1     1 a joe          
##  2     1 joe arrollo    
##  3     3 i rebelion     
##  4     4 quiero contarle
##  5     4 contarle mi    
##  6     4 mi hermano     
##  7     5 un pedacito    
##  8     5 pedacito de    
##  9     5 de la          
## 10     5 la historia
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Rock_canciones_bi %>%
  count(bigram, sort = TRUE) %>%
  head(n = 10)
## # A tibble: 10 × 2
##    bigram           n
##    <chr>        <int>
##  1 de mi           16
##  2 lo que          15
##  3 hoy quiero      14
##  4 oh oh           14
##  5 nubes negras    13
##  6 el album        12
##  7 mi cabeza       12
##  8 mi corazón      12
##  9 album de        11
## 10 cabeza sólo     11
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Baladas_bi %>%
  count(bigram, sort = TRUE) %>%
  head(n = 10)
## # A tibble: 10 × 2
##    bigram      n
##    <chr>   <int>
##  1 oh oh     120
##  2 mmh mmh    49
##  3 de ti      34
##  4 no se      33
##  5 que no     31
##  6 que te     28
##  7 en la      24
##  8 mi vida    20
##  9 oh ooh     20
## 10 no es      19
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Reggaeton_bi %>%
  count(bigram, sort = TRUE) %>%
  head(n = 10)
## # A tibble: 10 × 2
##    bigram       n
##    <chr>    <int>
##  1 en mi       30
##  2 no te       27
##  3 lo que      25
##  4 que no      25
##  5 no no       24
##  6 que te      23
##  7 que me      22
##  8 se que      20
##  9 la santa    18
## 10 mi cama     18
###### top 10 de bigramas mas frecuentes
# hay bigramas que no son interesantes (e.g., "de la")
# esto motiva el uso de stop words nuevamente
text_Salsa_canciones_bi %>%
  count(bigram, sort = TRUE) %>%
  head(n = 10)
## # A tibble: 10 × 2
##    bigram       n
##    <chr>    <int>
##  1 no no       36
##  2 me quedo    27
##  3 a la        26
##  4 el mundo    24
##  5 que te      24
##  6 la vida     23
##  7 que no      22
##  8 le pegue    21
##  9 no le       21
## 10 pegue a     21
##### omitir stop words
text_Rock_canciones_bi %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word1)) %>%
  mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word2)) %>%
  filter(!is.na(word1)) %>% 
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_Rock_canciones_bi_counts  # importante para la conformacion de la red!
dim(text_Rock_canciones_bi_counts)
## [1] 213   3
head(text_Rock_canciones_bi_counts, n = 10)
## # A tibble: 10 × 3
##    word1     word2    weight
##    <chr>     <chr>     <int>
##  1 nubes     negras       13
##  2 cabeza    solo         11
##  3 fotos     tuyas        11
##  4 florecita rockera       9
##  5 solo      adentro       9
##  6 quiero    entender      8
##  7 solo      quiero        8
##  8 diablo    amor          7
##  9 muchos    años          6
## 10 negras    sobre         6
##### omitir stop words
text_Baladas_bi %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word1)) %>%
  mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word2)) %>%
  filter(!is.na(word1)) %>% 
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_Baladas_bi_counts  # importante para la conformacion de la red!
dim(text_Baladas_bi_counts)
## [1] 396   3
head(text_Baladas_bi_counts, n = 10)
## # A tibble: 10 × 3
##    word1    word2    weight
##    <chr>    <chr>     <int>
##  1 solo     quiero       16
##  2 besos    matan        12
##  3 donde    vamos         7
##  4 nadie    ve            7
##  5 primer   millon        7
##  6 bota     fuego         6
##  7 matan    morire        6
##  8 perderme contigo       6
##  9 pudiera  darle         6
## 10 puede    prohibir      6
##### omitir stop words
text_Reggaeton_bi %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word1)) %>%
  mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word2)) %>%
  filter(!is.na(word1)) %>% 
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_Reggaeton_bi_counts  # importante para la conformacion de la red!
dim(text_Reggaeton_bi_counts)
## [1] 376   3
head(text_Reggaeton_bi_counts, n = 10)
## # A tibble: 10 × 3
##    word1     word2     weight
##    <chr>     <chr>      <int>
##  1 sigue     bailando       8
##  2 bailando  mami           6
##  3 cogi      anoche         6
##  4 levanta   baby           6
##  5 misma     hora           6
##  6 necesita  reggaeton      6
##  7 pantalon  dale           6
##  8 quiero    tenerte        6
##  9 reggaeton dale           6
## 10 rozamo    algo           6
##### omitir stop words
text_Salsa_canciones_bi %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word1)) %>%
  mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word2)) %>%
  filter(!is.na(word1)) %>% 
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_Salsa_canciones_bi_counts  # importante para la conformacion de la red!
dim(text_Salsa_canciones_bi_counts)
## [1] 470   3
head(text_Salsa_canciones_bi_counts, n = 10)
## # A tibble: 10 × 3
##    word1      word2     weight
##    <chr>      <chr>      <int>
##  1 quiero     mas           17
##  2 negado     amor          12
##  3 vida       dura          12
##  4 jamas      jamas         11
##  5 mas        bonita         9
##  6 otro       pasito         8
##  7 mas        ni             7
##  8 cachondeas vagabundo      6
##  9 ese        men            6
## 10 mundo      quiere         6
##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Rock_canciones_bi_counts %>%
  filter(weight > 2) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Rock con Umbral = 3")

##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Baladas_bi_counts %>%
  filter(weight > 2) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'purple', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Baladas con Umbral = 1")

##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Reggaeton_bi_counts %>%
  filter(weight > 2) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'maroon', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Reggaeton con Umbral = 2")

##### definir una red a partir de la frecuencia (weight) de los bigramas
# binaria, no dirigida, ponderada, simple
# se recomienda variar el umbral del filtro y construir bigramas no consecutivos para obtener redes con mayor informacion
suppressMessages(suppressWarnings(library(igraph)))
g <- text_Salsa_canciones_bi_counts %>%
  filter(weight > 2) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'navyblue', vertex.label.cex = 1, vertex.label.dist = 1, main = "Bigramas Salsa con Umbral = 4")

##### red con un umbral diferente
g <- text_Rock_canciones_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 3")

##### red con un umbral diferente
g <- text_Baladas_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 1")

##### red con un umbral diferente
g <- text_Reggaeton_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 2")

##### red con un umbral diferente
g <- text_Salsa_canciones_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# viz
set.seed(123)
plot(g, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA, main = "Umbral = 4")

##### componente conexa mas grande de la red
g <- text_Rock_canciones_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
## Warning: `clusters()` was deprecated in igraph 2.0.0.
## ℹ Please use `components()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('red4', 0.1), vertex.frame.color = 'red4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Rock Canciones", outer = T, line = -1)

##### componente conexa mas grande de la red
g <- text_Baladas_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.6, vertex.label.dist = 2)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('green4', 0.1), vertex.frame.color = 'green4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Baladas", outer = T, line = -1)

##### componente conexa mas grande de la red
g <- text_Reggaeton_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.6, vertex.label.dist = 2)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('blue4', 0.1), vertex.frame.color = 'blue4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Reggaeton", outer = T, line = -1)

##### componente conexa mas grande de la red
g <- text_Salsa_canciones_bi_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label.color = 'black', vertex.label.cex = 0.6, vertex.label.dist = 2)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_kk, vertex.color = adjustcolor('purple4', 0.1), vertex.frame.color = '4', vertex.size = 2*strength(gcc), vertex.label.color = 'black', vertex.label.cex = 0.9, vertex.label.dist = 1, edge.width = 3*E(g)$weight/max(E(g)$weight))
title(main = "Componente conexa Baladas", outer = T, line = -1)

13 Baladas: Skip-grams

##### skip-gram: Ejemplo cancion Viento_Caifanes
# texto
text <- c("Préstame tu peine",
          "Y péiname el alma",
          "Desenrédame",
          "Fuera de este mundo",
          "Dime que no estoy",
          "Soñándote",
          "Enséñame",
          "De qué estamos hechos",
          "Que quiero orbitar planetas",
          "Hasta ver uno vació",
          "Que quiero irme a vivir",
          "Pero que sea contigo",
          "Viento",
          "Amárranos",
          "Tiempo",
          "Detente muchos años",
          "Viento",
          "Amárranos",
          "Tiempo",
          "Detente muchos años")
# convertir a data frame
text_df <- tibble(line = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line skipgram      
##    <int> <chr>         
##  1     1 préstame      
##  2     1 préstame tu   
##  3     1 préstame peine
##  4     1 tu            
##  5     1 tu peine      
##  6     1 peine         
##  7     2 y             
##  8     2 y péiname     
##  9     2 y el          
## 10     2 péiname
##### skip-gram: Ejemplo cancion: Tabaco y Chanel
# texto
text <- c("Un olor a tabaco y Chanel",
"Me recuerda el olor de su piel",
"Una mezcla de miel y café",
"Me recuerda el sabor de sus besos",
"El color del final de la noche",
"Me pregunta dónde fui a parar, dónde estás",
"Que esto solo se vive una vez",
"Dónde fuiste a parar, dónde estás",
"Un olor a tabaco y Chanel",
"Y una mezcla de miel y café",
"Me preguntan por ella (ella) Me",
"preguntan por ella",
"Me preguntan también las estrellas",
"Me reclaman que vuelva por ella",
"Ay, que vuelva por ella (ella)",
"Ay, que vuelva por ella)")
# convertir a data frame
text_df <- tibble(line  = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line skipgram   
##    <int> <chr>      
##  1     1 un         
##  2     1 un olor    
##  3     1 un a       
##  4     1 olor       
##  5     1 olor a     
##  6     1 olor tabaco
##  7     1 a          
##  8     1 a tabaco   
##  9     1 a y        
## 10     1 tabaco
##### skip-gram: Ejemplo cancion: Safari
# texto
text <- c("Oye, papi, vamos con mis amigas para el party",
"Tengo algo por un animal",
"Cuando mi gente está aquí, hay tsunami",
"Wavy, así es lo que me gusta",
"You know I like it when tú fresco",
"Me llamo princesa",
"Voy a coger provecho",
"Lo que me gusta",
"You know I like it when tú fresco",
"Me llamo princesa",
"Voy a coger provecho")
# convertir a data frame
text_df <- tibble(line  = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line skipgram  
##    <int> <chr>     
##  1     1 oye       
##  2     1 oye papi  
##  3     1 oye vamos 
##  4     1 papi      
##  5     1 papi vamos
##  6     1 papi con  
##  7     1 vamos     
##  8     1 vamos con 
##  9     1 vamos mis 
## 10     1 con
##### skip-gram: Ejemplo cancion: Salsa-Yuribuenaventura
# texto
text <- c("La salsa que aquí les traigo",
"la traigo directo mira",
"la traigo de las entrañas",
"de mi américa latina",

"El día que estes llorando",
"y tu alma se encuentre triste",
"si bailas salsa mi hermano",
"olvidarás que lo fuiste")
# convertir a data frame
text_df <- tibble(line  = 1:length(text), text = text)
# tokenizar en bigramas
text_df %>% 
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  head(n = 10)
## # A tibble: 10 × 2
##     line skipgram  
##    <int> <chr>     
##  1     1 la        
##  2     1 la salsa  
##  3     1 la que    
##  4     1 salsa     
##  5     1 salsa que 
##  6     1 salsa aquí
##  7     1 que       
##  8     1 que aquí  
##  9     1 que les   
## 10     1 aquí
##### importar datos Rock
text_Rock_canciones <- unlist(c(read_csv("Rock_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_Rock_canciones) <- NULL
text_Rock_canciones <- tibble(line = 1:length(text_Rock_canciones), text = text_Rock_canciones)

##### importar datos Baladas
text_baladas <- unlist(c(read_csv("baladas.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_baladas) <- NULL
text_baladas <- tibble(line = 1:length(text_baladas), text = text_baladas)

##### importar datos Reggaeton
text_reggaeton <- unlist(c(read_csv("Reggaeton_proyecto.txt", col_names = FALSE, show_col_types = FALSE)))
names(text_reggaeton) <- NULL
text_reggaeton <- tibble(line = 1:length(text_reggaeton), text = text_reggaeton)

##### importar datos Salsa
text_salsa <- unlist(c(read_csv("Salsa_canciones.txt", col_names = FALSE, show_col_types = FALSE)))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
names(text_salsa) <- NULL
text_salsa <- tibble(line = 1:length(text_salsa), text = text_salsa)
##### tokenizar en skip-gram
# en este caso cada token es un unigrama o un bigrama regular o un bigrama con espaciamiento
# Rock
text_Rock_canciones %>%
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  filter(!is.na(skipgram)) -> text_Rock_canciones_skip
dim(text_Rock_canciones_skip)
## [1] 6684    2
head(text_Rock_canciones_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram      
##    <int> <chr>         
##  1     1 los           
##  2     1 los de        
##  3     1 los adentro   
##  4     1 de            
##  5     1 de adentro    
##  6     1 de nubes      
##  7     1 adentro       
##  8     1 adentro nubes 
##  9     1 adentro negras
## 10     1 nubes
# Baladas
text_baladas %>%
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  filter(!is.na(skipgram)) -> text_baladas_skip
dim(text_baladas_skip)
## [1] 15342     2
head(text_baladas_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram      
##    <int> <chr>         
##  1     1 quiero        
##  2     1 quiero volar  
##  3     1 quiero contigo
##  4     1 volar         
##  5     1 volar contigo 
##  6     1 contigo       
##  7     2 muy           
##  8     2 muy alto      
##  9     2 muy en        
## 10     2 alto
# Reggaeton
text_reggaeton %>%
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  filter(!is.na(skipgram)) -> text_reggaeton_skip
dim(text_reggaeton_skip)
## [1] 11100     2
head(text_reggaeton_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram    
##    <int> <chr>       
##  1     1 el          
##  2     1 el chisme   
##  3     1 el remix    
##  4     1 chisme      
##  5     1 chisme remix
##  6     1 remix       
##  7     2 ayo         
##  8     3 the         
##  9     3 the official
## 10     3 the remix
# Salsa
text_salsa %>%
  unnest_tokens(tbl = ., input = text, output = skipgram, token = "skip_ngrams", n = 2) %>%
  filter(!is.na(skipgram)) -> text_salsa_skip
dim(text_salsa_skip)
## [1] 13573     2
head(text_salsa_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram   
##    <int> <chr>      
##  1     1 a          
##  2     1 a joe      
##  3     1 a arrollo  
##  4     1 joe        
##  5     1 joe arrollo
##  6     1 arrollo    
##  7     2 canciones  
##  8     3 i          
##  9     3 i rebelion 
## 10     3 rebelion
##### remover unigramas
suppressMessages(suppressWarnings(library(ngram)))

# 1) Rock
# Contar palabras en cada skip-gram
text_Rock_canciones_skip$num_words <- text_Rock_canciones_skip$skipgram %>% 
  map_int(.f = ~ wordcount(.x))
head(text_Rock_canciones_skip, n = 10)
## # A tibble: 10 × 3
##     line skipgram       num_words
##    <int> <chr>              <int>
##  1     1 los                    1
##  2     1 los de                 2
##  3     1 los adentro            2
##  4     1 de                     1
##  5     1 de adentro             2
##  6     1 de nubes               2
##  7     1 adentro                1
##  8     1 adentro nubes          2
##  9     1 adentro negras         2
## 10     1 nubes                  1
# remover unigramas
text_Rock_canciones_skip %<>% 
  filter(num_words == 2) %>% 
  select(-num_words)
dim(text_Rock_canciones_skip)
## [1] 3840    2
head(text_Rock_canciones_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram      
##    <int> <chr>         
##  1     1 los de        
##  2     1 los adentro   
##  3     1 de adentro    
##  4     1 de nubes      
##  5     1 adentro nubes 
##  6     1 adentro negras
##  7     1 nubes negras  
##  8     2 ti movería    
##  9     2 ti cielo      
## 10     2 movería cielo
# 2) Baladas

# Contar palabras en cada skip-gram
text_baladas_skip$num_words <- text_baladas_skip$skipgram %>% 
  map_int(.f = ~ wordcount(.x))
head(text_baladas_skip, n = 10)
## # A tibble: 10 × 3
##     line skipgram       num_words
##    <int> <chr>              <int>
##  1     1 quiero                 1
##  2     1 quiero volar           2
##  3     1 quiero contigo         2
##  4     1 volar                  1
##  5     1 volar contigo          2
##  6     1 contigo                1
##  7     2 muy                    1
##  8     2 muy alto               2
##  9     2 muy en                 2
## 10     2 alto                   1
# Remover unigramas (solo conservar los skip-grams de 2 palabras)
text_baladas_skip %<>% 
  filter(num_words == 2) %>% 
  select(-num_words)

dim(text_baladas_skip)
## [1] 9271    2
head(text_baladas_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram      
##    <int> <chr>         
##  1     1 quiero volar  
##  2     1 quiero contigo
##  3     1 volar contigo 
##  4     2 muy alto      
##  5     2 muy en        
##  6     2 alto en       
##  7     2 alto algún    
##  8     2 en algún      
##  9     2 en lugar      
## 10     2 algún lugar
# 3) Reggaeton

# Contar palabras en cada skip-gram
text_reggaeton_skip$num_words <- text_reggaeton_skip$skipgram %>% 
  map_int(.f = ~ wordcount(.x))
head(text_reggaeton_skip, n = 10)
## # A tibble: 10 × 3
##     line skipgram     num_words
##    <int> <chr>            <int>
##  1     1 el                   1
##  2     1 el chisme            2
##  3     1 el remix             2
##  4     1 chisme               1
##  5     1 chisme remix         2
##  6     1 remix                1
##  7     2 ayo                  1
##  8     3 the                  1
##  9     3 the official         2
## 10     3 the remix            2
# Remover unigramas (solo conservar los skip-grams de 2 palabras)
text_reggaeton_skip %<>% 
  filter(num_words == 2) %>% 
  select(-num_words)

dim(text_reggaeton_skip)
## [1] 6668    2
head(text_reggaeton_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram      
##    <int> <chr>         
##  1     1 el chisme     
##  2     1 el remix      
##  3     1 chisme remix  
##  4     3 the official  
##  5     3 the remix     
##  6     3 official remix
##  7     3 official baby 
##  8     3 remix baby    
##  9     4 me duele      
## 10     4 me haberte
# 4) Salsa

# Contar palabras en cada skip-gram
text_salsa_skip$num_words <- text_salsa_skip$skipgram %>% 
  map_int(.f = ~ wordcount(.x))
head(text_salsa_skip, n = 10)
## # A tibble: 10 × 3
##     line skipgram    num_words
##    <int> <chr>           <int>
##  1     1 a                   1
##  2     1 a joe               2
##  3     1 a arrollo           2
##  4     1 joe                 1
##  5     1 joe arrollo         2
##  6     1 arrollo             1
##  7     2 canciones           1
##  8     3 i                   1
##  9     3 i rebelion          2
## 10     3 rebelion            1
# Remover unigramas (solo conservar los skip-grams de 2 palabras)
text_salsa_skip %<>% 
  filter(num_words == 2) %>% 
  select(-num_words)

dim(text_salsa_skip)
## [1] 8101    2
head(text_salsa_skip, n = 10)
## # A tibble: 10 × 2
##     line skipgram        
##    <int> <chr>           
##  1     1 a joe           
##  2     1 a arrollo       
##  3     1 joe arrollo     
##  4     3 i rebelion      
##  5     4 quiero contarle 
##  6     4 quiero mi       
##  7     4 contarle mi     
##  8     4 contarle hermano
##  9     4 mi hermano      
## 10     5 un pedacito
##### Omitir stop words

##### Rock
text_Rock_canciones_skip %>%
  separate(skipgram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word1)) %>%
  mutate(word2 = chartr(old = names(replacement_list) %>% str_c(collapse = ''), 
                       new = replacement_list %>% str_c(collapse = ''),
                       x = word2)) %>%
  filter(!is.na(word1)) %>% 
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_Rock_canciones_skip_counts
dim(text_Rock_canciones_skip_counts)
## [1] 455   3
head(text_Rock_canciones_skip_counts, n = 10)
## # A tibble: 10 × 3
##    word1      word2     weight
##    <chr>      <chr>      <int>
##  1 nubes      negras        13
##  2 cabeza     solo          11
##  3 fotos      tuyas         11
##  4 solo       fotos         11
##  5 tuyas      llena         11
##  6 florecita  rockera        9
##  7 solo       adentro        9
##  8 buscaste   despertar      8
##  9 despertar  pasion         8
## 10 encendiste hoguera        8
##### Baladas
text_baladas_skip %>%
  separate(skipgram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(
           old = names(replacement_list) %>% str_c(collapse = ''),
           new = replacement_list %>% str_c(collapse = ''),
           x = word1
         )) %>%
  mutate(word2 = chartr(
           old = names(replacement_list) %>% str_c(collapse = ''),
           new = replacement_list %>% str_c(collapse = ''),
           x = word2
         )) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_baladas_skip_counts

dim(text_baladas_skip_counts)
## [1] 854   3
head(text_baladas_skip_counts, n = 10)
## # A tibble: 10 × 3
##    word1   word2   weight
##    <chr>   <chr>    <int>
##  1 solo    quiero      16
##  2 atreves volver      12
##  3 besos   matan       12
##  4 como    atreves     12
##  5 quiero  contigo     11
##  6 olvida  nada         8
##  7 se      se           8
##  8 donde   vamos        7
##  9 nadie   ve           7
## 10 primer  millon       7
##### Reggaeton
text_reggaeton_skip %>%
  separate(skipgram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(
           old = names(replacement_list) %>% str_c(collapse = ''),
           new = replacement_list %>% str_c(collapse = ''),
           x = word1
         )) %>%
  mutate(word2 = chartr(
           old = names(replacement_list) %>% str_c(collapse = ''),
           new = replacement_list %>% str_c(collapse = ''),
           x = word2
         )) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_reggaeton_skip_counts

dim(text_reggaeton_skip_counts)
## [1] 741   3
head(text_reggaeton_skip_counts, n = 10)
## # A tibble: 10 × 3
##    word1    word2         weight
##    <chr>    <chr>          <int>
##  1 hacerte  amor              10
##  2 nena     tranquilicese      8
##  3 sigue    bailando           8
##  4 vida     mia                8
##  5 cuerpo   llama              7
##  6 bailando mami               6
##  7 clase    rumba              6
##  8 cogi     anoche             6
##  9 levanta  baby               6
## 10 mami     pare               6
##### Salsa
text_salsa_skip %>%
  separate(skipgram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words_es$word) %>%
  filter(!word2 %in% stop_words_es$word) %>%
  mutate(word1 = chartr(
           old = names(replacement_list) %>% str_c(collapse = ''),
           new = replacement_list %>% str_c(collapse = ''),
           x = word1
         )) %>%
  mutate(word2 = chartr(
           old = names(replacement_list) %>% str_c(collapse = ''),
           new = replacement_list %>% str_c(collapse = ''),
           x = word2
         )) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n) -> text_salsa_skip_counts

dim(text_salsa_skip_counts)
## [1] 997   3
head(text_salsa_skip_counts, n = 10)
## # A tibble: 10 × 3
##    word1        word2  weight
##    <chr>        <chr>   <int>
##  1 quiero       mas        18
##  2 barranquilla quedo      13
##  3 escuches     canto      13
##  4 negado       amor       12
##  5 vida         dura       12
##  6 jamas        jamas      11
##  7 son          son        10
##  8 mas          bonita      9
##  9 ae           otro        8
## 10 aventura     mas         8
##### definir una red a partir de la frecuencia (weight) de los bigramas

##### Rock
g <- text_Rock_canciones_skip_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g)  # importante!
# grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))
par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# viz 1
set.seed(123)
plot(gcc, layout = layout_with_fr, vertex.color = 1, vertex.frame.color = 1, vertex.size = 3, vertex.label = NA)
# viz 2
set.seed(123)
plot(gcc, layout = layout_with_fr, vertex.color = adjustcolor('red4', 0.1), vertex.frame.color = 'red4', vertex.size = 2*strength(gcc), vertex.label = NA)
title(main = "Componente conexa y Clusters - Rock", outer = T, line = -1)

##### Baladas
g <- text_baladas_skip_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g)

# Grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))

par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# Viz 1
set.seed(123)
plot(
  gcc,
  layout       = layout_with_fr,
  vertex.color = "skyblue",
  vertex.frame.color = "black",
  vertex.size  = 2,
  vertex.label = NA
)
# Viz 2
set.seed(123)
plot(
  gcc,
  layout       = layout_with_fr,
  vertex.color = adjustcolor('salmon', 0.1),
  vertex.frame.color = 'salmon',
  vertex.size  = 2*strength(gcc),
  vertex.label = NA
)
title(main = "Componente conexa y Clusters - Baladas", outer = TRUE, line = -1)

##### Reggaeton
g <- text_reggaeton_skip_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g)

# Grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))

par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# Viz 1
set.seed(123)
plot(
  gcc,
  layout             = layout_with_fr,
  vertex.color       = "gold",
  vertex.frame.color = "gold",
  vertex.size        = 2,
  vertex.shape       = "square",
  vertex.label       = NA
)
# Viz 2
set.seed(123)
plot(
  gcc,
  layout       = layout_with_fr,
  vertex.color = adjustcolor('lightgreen', 0.1),
  vertex.frame.color = 'darkgreen',
  vertex.size  = 2*strength(gcc),
  vertex.label = NA
)
title(main = "Componente conexa y Clusters - Reggaeton", outer = TRUE, line = -1)

##### Salsa
g <- text_salsa_skip_counts %>%
  filter(weight > 0) %>%
  graph_from_data_frame(directed = FALSE)
g <- igraph::simplify(g)

# Grafo inducido por la componente conexa
V(g)$cluster <- clusters(graph = g)$membership
gcc <- induced_subgraph(graph = g, vids = which(V(g)$cluster == which.max(clusters(graph = g)$csize)))

par(mfrow = c(1,2), mar = c(1,1,2,1), mgp = c(1,1,1))
# Viz 1
set.seed(123)
plot(
  gcc,
  layout             = layout_with_fr,
  vertex.color       = "firebrick4",
  vertex.frame.color = "firebrick3",
  vertex.size        = 3,
  vertex.shape       = "pie",
  vertex.label       = NA
)
# Viz 2
set.seed(123)
plot(
  gcc,
  layout       = layout_with_fr,
  vertex.color = adjustcolor('chocolate4', 0.1),
  vertex.frame.color = 'chocolate4',
  vertex.size  = 2*strength(gcc),
  vertex.label = NA
)
title(main = "Componente conexa y Clusters - Salsa", outer = TRUE, line = -1)

# Comparación

Géneros de Música Colombiana.

Skip-grams.

Componente conexa de la red conformada con umbral 1.

13.1 Redes

## Paralabras más importantes

Baladas: Top 10

## Warning: The `scale` argument of `eigen_centrality()` is deprecated as of igraph 2.1.1.
## ℹ eigen_centrality() will always behave as if scale=TRUE were used.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 10 × 2
##    word      eigen
##    <chr>     <dbl>
##  1 quiero    1    
##  2 solo      0.735
##  3 contigo   0.617
##  4 perderme  0.394
##  5 conmigo   0.206
##  6 otro      0.162
##  7 poco      0.146
##  8 encontrar 0.146
##  9 volar     0.132
## 10 decirle   0.128

Reggaeton: Top 10

## # A tibble: 10 × 2
##    word      eigen
##    <chr>     <dbl>
##  1 mami     1     
##  2 bailando 0.844 
##  3 sigue    0.844 
##  4 vida     0.430 
##  5 mia      0.430 
##  6 pare     0.399 
##  7 poderte  0.0289
##  8 suerte   0.0286
##  9 doy      0.0285
## 10 boom     0.0285

Rock : Top 10

## # A tibble: 10 × 2
##    word     eigen
##    <chr>    <dbl>
##  1 solo     1    
##  2 fotos    0.576
##  3 quiero   0.546
##  4 entender 0.492
##  5 cabeza   0.437
##  6 adentro  0.424
##  7 tuyas    0.318
##  8 nada     0.173
##  9 existe   0.170
## 10 cuido    0.170

Salsa : Top 10

## # A tibble: 10 × 2
##    word       eigen
##    <chr>      <dbl>
##  1 mas       1     
##  2 quiero    0.842 
##  3 bonita    0.352 
##  4 jamas     0.324 
##  5 aventura  0.303 
##  6 ni        0.286 
##  7 son       0.116 
##  8 hacer     0.107 
##  9 arte      0.107 
## 10 mostrarte 0.0970

13.2 Agrupamiento

##                    Baladas Reggaeton Rock_canciones  Salsa_canciones
## Tamaño partición        30        24             15               29
## Tamaño grupo menor       3         5              5                2
## Tamaño grupo mayor      53        32             26               53

Baladas: Top 5 grupo mayor

## # A tibble: 5 × 3
##   word     cluster eigen
##   <chr>      <dbl> <dbl>
## 1 quiero         1 1    
## 2 solo           1 0.735
## 3 contigo        1 0.617
## 4 perderme       1 0.394
## 5 conmigo        1 0.206

Reggaeton: Top 5 grupo mayor

## # A tibble: 5 × 3
##   word       cluster       eigen
##   <chr>        <dbl>       <dbl>
## 1 verdad           5 0.00000668 
## 2 siempre          5 0.00000179 
## 3 dime             5 0.000000448
## 4 escuchaste       5 0.000000442
## 5 plata            5 0.000000361

Rock: Top 5 grupo mayor

## # A tibble: 5 × 3
##   word   cluster        eigen
##   <chr>    <dbl>        <dbl>
## 1 siento       4 0.000000488 
## 2 morir        4 0.0000000779
## 3 calor        4 0.0000000221
## 4 gran         4 0.0000000203
## 5 alguna       4 0.0000000195

Salsa: Top 5 grupo mayor

## # A tibble: 5 × 3
##   word    cluster     eigen
##   <chr>     <dbl>     <dbl>
## 1 pobre         6 0.00888  
## 2 dios          6 0.00102  
## 3 viajero       6 0.000676 
## 4 llegan        6 0.000334 
## 5 triste        6 0.0000784